Chapter · Backend Systems

Mastering Concurrency & Parallelism

Every backend system you will ever build shares one non-negotiable requirement: it must handle multiple things at once. A web server that can only process one request at a time forces every other user to wait — and that simply doesn't work in production. This guide builds the complete mental model from the ground up — from raw CPU cycles and IO operations, through OS threads and event loops, to modern concurrency primitives like goroutines and async/await. Everything is grounded in practical backend code using Go and Python.

Why Concurrency Matters

When we talk about a "backend application" in this context, we mean an HTTP-based web server — a process running on a machine, listening on a port, receiving requests from browsers and other clients, and sending responses back.

If your server can only process one request at a time, then every other user — potentially thousands of them — must either wait in line or receive an error. That is an impossible situation for any real production application.

Key Insight

Understanding how your server does multiple things at once is not just "nice to know" — it shapes how you debug issues, structure your application, and make architectural decisions.

Most of us learn keywords like async, await, and threading from our programming language's documentation. We know they "make things concurrent." But we rarely understand what these keywords actually do behind the scenes — at a mechanical level, what happens in the CPU and the operating system.

The Cost of Sitting Idle

Let's start with a concrete scenario. You have a browser making a request to your backend. The backend needs to query a database, wait for the response, and return data to the client. That's the typical lifecycle of an API call.

How long does a database query take?

The waiting time depends on where your database lives relative to your server:

Scenario	Latency	Description
Localhost	1–2 ms	Server and database on the same machine
Same region, different AZ	20–30 ms	Typical production setup (e.g. AWS us-east-1a → us-east-1b)
Cross-region	90–100 ms	Database in a geographically distant region

During that waiting time — whether 2 ms or 100 ms — if your server is processing synchronously (one request at a time, line by line), your CPU is doing absolutely nothing. It is completely idle.

Quantifying the waste

A modern CPU executes roughly 3 billion instructions per second, which is about 3 million instructions per millisecond. If your server sits idle for 100 ms waiting on a database response, it could have executed 300 million instructions — but instead it executed zero.

Fig 1 — A synchronous server wastes the vast majority of its CPU cycles waiting for IO

A realistic API call breakdown

A typical mid-to-complex API call involves 3–5 database queries plus 1–2 external service calls (email, cache like Redis, etc.). If each of those five network operations takes an average of 50 ms, the total time spent waiting for IO is around 250 ms. But the actual CPU computation — validation, JSON parsing, serialization — is only about 10 ms. That means your server's hardware is idle 95% of the time. This is exactly the problem concurrency solves.

IO-Bound vs CPU-Bound

Every operation your backend performs falls into one of two categories. This distinction is the most important concept in this entire guide — it drives every decision about which concurrency model to use.

IO-Bound operations

IO stands for Input/Output — any operation where the CPU cannot do anything useful because it is waiting for something external. The "I" refers to data coming into your process; the "O" refers to data going out. Per the Linux read(2) man page, a read on a file descriptor (socket, file, pipe) will block the calling thread until data arrives.

Common IO-bound operations in backend systems:

Database queries — sending a SQL query over TCP, waiting for rows to come back
External API calls — HTTP requests to third-party services (payment gateways, email providers)
File system operations — reading config files, writing uploads, temporary file handling
Cache interactions — Redis GET/SET operations over the network
Standard I/O — logging to stdout, reading from stdin
Message queues — publishing to or consuming from Kafka, RabbitMQ, etc.

CPU-Bound operations

CPU-bound work is when the processor is actively crunching data — it's executing instructions as fast as it can, and the bottleneck is the CPU itself, not any external resource. Most CPU-bound work in a typical backend is lightweight (1–2 ms for JSON parsing or validation), but some operations are heavily CPU-bound:

Image/video processing — matrix multiplications, convolutions, color space transforms
Encryption / hashing — JWT verification, bcrypt password hashing, TLS handshakes
Compression — gzip, zstd, Brotli encoding of response bodies
Serialization — large-scale JSON/Protobuf marshal/unmarshal
ML inference — running a model forward pass without GPU offloading

The Rule of Thumb

More than 70% of the time in a typical backend application is spent on IO. For IO-bound work, concurrency is mandatory. For CPU-bound work, parallelism (multiple cores) gives the speedup.

Fig 2 — CPU work (red) is dwarfed by IO waiting (orange) in a typical API call

Concurrency vs Parallelism

These two terms are often confused — and for good reason, since they both involve "doing multiple things." But the distinction is critical.

Parallelism: doing multiple things at the same time

A program is parallel when it executes multiple instructions at the same moment. This requires hardware support: at minimum two CPU cores, because a single core can only execute one instruction at one instant. If you have 4 cores, you can run 4 instructions truly simultaneously.

Concurrency: dealing with multiple things at once

Concurrency is about structuring your program so that multiple tasks can be started, paused, and resumed, even on a single core. From the outside, it looks like everything is happening at once. But zoom into any single moment, and only one thing is actually executing on a given core. The key is that when one task is blocked (waiting for IO), another task gets the CPU.

The Classic One-Liner

Parallelism is about doing multiple things at once.
Concurrency is about dealing with multiple things at once.

— Attributed to Rob Pike, co-creator of Go

Fig 3 — Concurrency interleaves tasks on one core; Parallelism runs them on separate cores simultaneously

OS Threads

The first fundamental mechanism for doing multiple things at once is the thread. A thread is an independent unit of execution managed by the operating system. It is not a feature of your programming language — it's a feature of the OS kernel. When you "create a thread," you're making a system call (like clone() on Linux or pthread_create() via POSIX) to ask the kernel to set up a new execution context.

What the OS creates for each thread

When a new thread is created, the kernel allocates several things for it:

A stack — used for tracking function calls (call frames), local variables, and return addresses. On Linux, the default stack size is 8 MB of virtual memory (though physical pages are only allocated on demand via page faults).
An instruction pointer (IP) — tracks where exactly the thread is in its code, so the kernel can resume it after a context switch.
Kernel data structures — metadata for the scheduler: thread state (running / sleeping / blocked), priority, CPU affinity, signal masks, etc.

Reference

pthreads(7) — Linux man page clone(2) — the system call behind thread creation Python threading module documentation

Preemptive scheduling

The OS scheduler decides which thread gets CPU time and for how long. It assigns each thread a time slice (typically a few milliseconds). After that slice expires, the scheduler forcibly pauses the thread — whether it's finished or not — saves its state, and switches to another thread. This is called preemptive scheduling because threads are preempted (interrupted) without their cooperation.

Shared memory between threads

An important property: threads within the same process share the same memory space (heap, global variables, file descriptors). If thread 1 allocates an object on the heap, thread 2 can access it via a pointer. This makes inter-thread communication fast (no copying, no serialization) — but also dangerous, as we'll see in the race conditions section.

Threads across different processes, however, are fully isolated — they cannot see each other's memory. This isolation is enforced by the kernel's virtual memory system for security and stability.

Thread-per-request in Python

# Python — threading model (simplified)
import threading

def handle_request(conn):
    # 1. Parse the HTTP request (CPU work)
    request = parse_http(conn)

    # 2. Database query — this BLOCKS the thread
    #    The OS scheduler will switch to another thread
    user = db.execute(
        "SELECT * FROM users WHERE id = %s",
        (request.user_id,)
    )

    # 3. Thread resumes here after DB responds
    response = json.dumps(user)
    conn.send(response)

# Create a new OS thread for each incoming connection
while True:
    conn = server_socket.accept()
    t = threading.Thread(target=handle_request, args=(conn,))
    t.start()Python

Notice there's no async or await anywhere. The blocking happens transparently — at the OS level — when db.execute() makes a network call. The thread is marked as "blocked," the OS switches to another thread, and later wakes this thread up when the database response arrives on the socket.

Thread Overhead & Context Switching

Threads are powerful, but they come with three categories of overhead that limit how many you can practically run.

1. Memory overhead

Each thread gets its own stack. On Linux the default is 8 MB of virtual address space (though only the touched pages consume physical RAM). Even at a conservative 500 KB–1 MB of actual physical memory per thread, 10,000 simultaneous connections would consume 5–10 GB of RAM just for thread stacks — before your application even allocates a single data structure.

2. Creation overhead

Creating a thread requires a system call to the kernel. The kernel must allocate the stack, set up internal data structures (thread descriptor, signal state, scheduler metadata), and register the thread with the scheduler. This takes anywhere from a few microseconds to a few milliseconds — fast by human standards, but significant at scale when you're creating and destroying thousands of threads.

3. Context switch overhead

This is the big one. Every time the scheduler switches from one thread to another, it must:

Save the current thread's CPU registers (general-purpose, floating-point, SIMD)
Update bookkeeping structures (thread state, time accounting)
Select the next thread to run (involves scheduler algorithm)
Restore the saved registers of the new thread
Potentially flush CPU caches (L1/L2 cache lines become "cold")

Each context switch costs 1–10 microseconds on modern hardware. With 4 cores and 1,000 threads, the scheduler must constantly juggle between them, and this overhead can become a measurable source of latency. The context switch time is purely unproductive — it's maintenance work, not real computation.

Fig 4 — Context switching is pure overhead: no useful work happens during the switch

Why thread-per-request struggles

This is exactly why the "one thread per request" model breaks down under high concurrency. 10,000 concurrent connections = 10,000 threads = gigabytes of stack memory + constant context switching overhead. The C10k problem was the industry's realization that this model simply doesn't scale.

The Event Loop Model

The event loop takes a radically different approach to concurrency: instead of many threads, it uses one thread and a loop that continuously checks for completed IO operations, then runs the appropriate callback. This is the model that powers Node.js, Python's asyncio, Nginx, and many other high-performance systems.

The core mechanism

When a task (like serving Request A) needs to make a database query, instead of blocking and waiting, it:

Initiates the IO — sends the query bytes over the TCP socket to the database.
Registers a callback — tells the event loop: "When the database responds on this socket, run this function."
Yields control — returns immediately, freeing the single thread to pick up the next task.

The event loop itself is essentially an infinite loop. On each iteration, it uses an OS-level mechanism to check which sockets have data ready — then it runs the callbacks for those completed operations and goes back to checking.

OS-level IO multiplexing

The event loop doesn't implement IO monitoring itself — it relies on the operating system. The OS provides special system calls that can efficiently monitor thousands of file descriptors (sockets, files, pipes) simultaneously:

OS	Mechanism	Notes
Linux	`epoll`	O(1) per ready event. Scales to millions of connections. See epoll(7)
macOS / BSD	`kqueue`	Similar to epoll but with a different API. See kqueue(2)
Windows	`IOCP`	IO Completion Ports — Windows' equivalent

Under the hood, Node.js uses libuv which abstracts these platform-specific mechanisms. Python's asyncio uses selectors module which wraps epoll/kqueue.

Fig 5 — The event loop checks IO status on each iteration and runs callbacks for completed operations

Why event loops are efficient for IO

Since there's only one thread, there is no context switching overhead, no multi-megabyte stack allocations, and no scheduler juggling. The CPU spends its time either running your actual code (callbacks) or waiting efficiently inside epoll_wait(). This is why Node.js can handle tens of thousands of concurrent connections on modest hardware.

The critical trade-off: never block the loop

If you run a CPU-intensive operation — say, a 100 ms image processing task — directly on the event loop thread, everything stops. No other requests can be served, no callbacks can fire, no IO completions can be processed. The entire server freezes until that CPU work finishes. This is why event-loop-based languages like JavaScript insist on non-blocking code and offload heavy computation to worker threads or separate processes.

Callbacks → Promises → async/await: the evolution

The event loop works on callbacks — but raw callbacks lead to deeply nested code (the infamous "callback hell"). The evolution in JavaScript went:

// Stage 1: Raw callbacks (pre-ES6)
db.query("SELECT * FROM users WHERE id = ?", [userId],
  function(err, result) {
    if (err) return sendError(res, err);
    // Nested callback for second query
    db.query("SELECT * FROM orders WHERE user_id = ?", [userId],
      function(err, orders) {
        if (err) return sendError(res, err);
        // Another level of nesting...
        sendResponse(res, { user: result, orders: orders });
      }
    );
  }
);JavaScript — callback hell

// Stage 2: async/await (modern — syntactic sugar over callbacks)
async function handleRequest(req, res) {
  const user  = await db.query("SELECT * FROM users WHERE id = ?", [req.userId]);
  const orders = await db.query("SELECT * FROM orders WHERE user_id = ?", [req.userId]);
  sendResponse(res, { user, orders });
}JavaScript — async/await

The await keyword is syntactic sugar. When the runtime encounters await, it says: "I'm giving up control of the CPU to the event loop. Everything after this line is effectively a callback that will run when this IO operation completes."

Python's asyncio: the same idea

import asyncio
import aiohttp

async def handle_request(user_id: int) -> dict:
    # await = "yield control to the event loop until this IO completes"
    user = await db.fetchone(
        "SELECT * FROM users WHERE id = $1", user_id
    )
    orders = await db.fetch(
        "SELECT * FROM orders WHERE user_id = $1", user_id
    )
    return {"user": user, "orders": orders}

# Run with an event loop (uvloop for performance)
asyncio.run(handle_request(123))Python

Reference

MDN — async function MDN — await operator MDN — The event loop Python — asyncio documentation

Async/Await as a State Machine

To truly understand async/await, it helps to see what the compiler or runtime actually transforms your code into. An async function is essentially converted into a state machine that interacts with the event loop. Each await is a transition point between states.

The original async function

async def fetch_user_data(user_id):
    user   = await db.get_user(user_id)      # State 0 → 1
    orders = await db.get_orders(user_id)    # State 1 → 2
    return {"user": user, "orders": orders}  # State 2 → donePython

What the runtime sees (conceptual)

# Conceptual state machine representation
def fetch_user_data_machine(user_id):
    state = 0
    user = None
    orders = None

    def step():
        nonlocal state, user, orders

        if state == 0:
            state = 1
            # Start the IO, register callback for when it completes
            return db.get_user(user_id).then(
                lambda result: (user := result, step())
            )

        elif state == 1:
            state = 2
            return db.get_orders(user_id).then(
                lambda result: (orders := result, step())
            )

        elif state == 2:
            return {"user": user, "orders": orders}

    return stepPython — conceptual

Fig 6 — Each await is a state transition; the event loop runs other work in between

Two things this explains

1. Why await can only appear inside async functions. The function must be marked async because it needs to be transformed into a state machine. A regular function has no state machine infrastructure to support pausing and resuming.

2. Why blocking the event loop is catastrophic. If you perform a CPU-intensive operation inside a state (e.g. a 500 ms computation), the state machine can never transition. No other state machines (other requests) can advance. The entire server hangs.

Go Routines & Virtual Threads

Go takes a third approach that is neither raw OS threads nor a single-threaded event loop. It uses goroutines — lightweight, user-space threads managed by Go's own runtime scheduler. The term "virtual thread" captures the idea: they behave like threads (blocking code, sequential logic), but they're far cheaper because they're not OS-level constructs.

Why goroutines are cheap

Property	OS Thread	Goroutine
Stack size	~8 MB (virtual), ~KB physical	~2–8 KB initial (grows dynamically)
Creation time	~μs to ms (system call)	~ns to μs (user-space allocation)
Context switch	1–10 μs (kernel mode)	~200 ns (user-space pointer swap)
Practical limit	~10,000 (memory bottleneck)	~1,000,000+ (depending on RAM)
Scheduled by	OS kernel	Go runtime scheduler

Go's HTTP server: one goroutine per request

Go's standard library net/http creates a new goroutine for every incoming request. This sounds like the thread-per-request model that we just said doesn't scale — but goroutines are so lightweight that it works beautifully. Here's the actual source code pattern from Go's net/http package:

// From Go's standard library net/http/server.go (simplified)
func (srv *Server) Serve(l net.Listener) error {
    for {
        conn, err := l.Accept()   // Accept new TCP connection
        if err != nil {
            return err
        }
        // Create a NEW goroutine for each connection
        go srv.handleConn(conn)
    }
}Go

The go keyword spawns a new goroutine. This is the equivalent of creating a thread — but at a fraction of the cost (nanoseconds vs. microseconds, kilobytes vs. megabytes).

A typical Go handler

func handleGetUser(w http.ResponseWriter, r *http.Request) {
    userID := r.URL.Query().Get("id")

    // This BLOCKS the goroutine — but not the OS thread.
    // The Go scheduler parks this goroutine and runs another.
    user, err := db.QueryRow(
        "SELECT name, email FROM users WHERE id = $1",
        userID,
    )
    if err != nil {
        http.Error(w, "not found", http.StatusNotFound)
        return
    }

    // Goroutine resumes here after DB responds
    json.NewEncoder(w).Encode(user)
}Go

Notice: no async, no await, no callbacks. The code reads like synchronous, blocking code — because the goroutine does block. But the Go runtime scheduler transparently parks it and picks up another goroutine. You get the ergonomics of blocking code with the efficiency of non-blocking IO.

The M:N Scheduler

Go's scheduler uses an M:N model — it multiplexes M goroutines onto N OS threads. The number of OS threads is controlled by the GOMAXPROCS environment variable (defaults to the number of CPU cores). Each OS thread runs a local queue of goroutines.

Fig 7 — Go multiplexes many goroutines across a fixed pool of OS threads

When a goroutine blocks on IO (e.g., a database read), the Go runtime's internal network poller (built on epoll/kqueue under the hood) detects the IO completion and puts the goroutine back on a run queue. This is the best of both worlds: blocking-style code with event-loop-level efficiency.

Reference

Go — Effective Go: Goroutines Go — Effective Go: Concurrency Go Blog — Go Concurrency Patterns: Pipelines

Race Conditions & Shared State

Concurrency introduces an entire category of bugs that don't exist in sequential code. Almost all of them trace back to one root cause: shared mutable state. When two concurrent tasks read and write the same variable, and their operations interleave in unexpected ways, you get a race condition.

The classic counter example

Two threads both try to increment a shared counter from 0 to 1. Incrementing requires three CPU steps: read the current value, add one, write it back. If the threads interleave:

Fig 8 — Two threads read the same stale value, and one update overwrites the other

Race conditions in async/await too

You might think: "I use JavaScript / Python with async/await — there's only one thread, so no race conditions." Wrong. Race conditions can still occur between await points. Consider this Python example:

balance = 100

async def withdraw(amount: int):
    global balance
    if balance >= amount:               # Check at time T1
        await process_withdrawal(amount)  # ← yields control here!
        balance -= amount                 # Deduct at time T2

# Both coroutines see balance=100, both pass the check,
# both deduct 100 → balance = -100 (invalid!)
await asyncio.gather(
    withdraw(100),
    withdraw(100),
)Python

The issue: the if check and the deduction are separated by an await. Between those two points, the event loop runs the second withdraw call, which also sees balance == 100 and passes the check. Both deduct, and the balance goes to -100.

Key Takeaway

Race conditions happen any time there is a check-then-act pattern separated by a yield point (await) or a context switch. Single-threaded async code is not immune to race conditions.

Locks, Mutexes & Channels

The solutions to race conditions have been studied for decades. Here are the major approaches.

Mutex (Mutual Exclusion)

A mutex ensures that only one thread/goroutine can execute a critical section at a time. All others must wait until the lock is released. This is the most common synchronization primitive across all languages.

Python — threading.Lock

import threading

counter = 0
lock = threading.Lock()

def increment():
    global counter
    with lock:          # Acquire lock — blocks other threads
        counter += 1   # Only one thread executes this at a time
                        # Lock is released automatically at end of `with`

# Create 1000 threads all incrementing the same counter
threads = [threading.Thread(target=increment) for _ in range(1000)]
for t in threads: t.start()
for t in threads: t.join()
print(counter)  # Always 1000 — no lost updatesPython

Go — sync.Mutex

package main

import (
    "fmt"
    "sync"
)

func main() {
    var counter int
    var mu sync.Mutex
    var wg sync.WaitGroup

    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            mu.Lock()         // Acquire — other goroutines wait here
            counter++
            mu.Unlock()       // Release — next goroutine can proceed
        }()
    }

    wg.Wait()
    fmt.Println(counter) // Always 1000
}Go

Channels: Go's preferred approach

Go has a famous proverb: "Don't communicate by sharing memory; share memory by communicating." Instead of multiple goroutines writing to a shared variable (which needs a mutex), you can use channels — typed pipes through which goroutines send and receive values. Only the goroutine that owns the data modifies it; others send messages to request changes.

package main

import "fmt"

func counterService(ch chan int, done chan bool) {
    counter := 0
    for {
        select {
        case delta := <-ch:
            counter += delta   // Only this goroutine modifies counter
        case done <- counter:
            return
        }
    }
}

func main() {
    ch := make(chan int, 100)     // Buffered channel
    done := make(chan bool)

    go counterService(ch, done)

    // 1000 goroutines send +1 through the channel
    for i := 0; i < 1000; i++ {
        ch <- 1
    }

    result := <-done
    fmt.Println(result) // 1000 — no race condition, no mutex needed
}Go

Python asyncio.Lock for async code

import asyncio

balance = 100
lock = asyncio.Lock()

async def withdraw(amount: int):
    global balance
    async with lock:                     # Only one coroutine at a time
        if balance >= amount:
            await process_withdrawal(amount)
            balance -= amount

# Now safe — second withdraw waits for the lock
await asyncio.gather(
    withdraw(100),
    withdraw(100),
)
print(balance)  # 0, not -100Python

Reference

Python — threading.Lock Python — asyncio.Lock Go — The Go Memory Model Go — Effective Go: Channels Go Blog — Share Memory by Communicating

Choosing the Right Model

Here's the practical summary — when to use what, and why.

Workload	Best Model	Why	Examples
IO-Bound, high concurrency	Event loop / async-await / goroutines	Minimal overhead, no context switching, handles 100K+ connections	Web servers, API gateways, microservice proxies, chat servers
CPU-Bound, parallelizable	OS threads with parallelism (multiple cores)	True simultaneous execution; more cores = proportionally faster	Image processing, video encoding, ML inference, encryption
Mixed IO + CPU	Goroutines (Go) or worker pools + async (Python/JS)	IO handled concurrently; CPU work offloaded to thread/process pools	Most real-world backends — API calls + some computation

The Bottom Line

Concurrency keeps your program productive — it stops the CPU from idling while waiting for IO. Use async/await, goroutines, or virtual threads for this.

Parallelism lets you use multiple CPU cores to do multiple computations at the same instant. Use OS threads or process pools for this.

Most backend applications are overwhelmingly IO-bound — the bottleneck is almost always the network, not the CPU. But when you do hit CPU-bound work, parallelism is the tool that unlocks more throughput.

Language-specific concurrency primitives

Language	IO Concurrency	CPU Parallelism	Underlying Mechanism
Go	`go func()` (goroutines)	Goroutines across cores (automatic)	M:N scheduler + `epoll`/`kqueue`
Python	`asyncio` + `async/await`	`multiprocessing` / `concurrent.futures`	Event loop + process pool (GIL limits threads)
JavaScript	`async/await` / Promises	`worker_threads` / `cluster`	libuv event loop
Java (21+)	Virtual Threads (`Thread.ofVirtual()`)	Platform Threads + ForkJoinPool	Continuation-based M:N scheduling
Rust	`async/await` + Tokio runtime	`std::thread` / Rayon	Multi-threaded work-stealing async runtime

Mastering Concurrency & Parallelism

Why Concurrency Matters

The Cost of Sitting Idle

How long does a database query take?

Quantifying the waste

A realistic API call breakdown

IO-Bound vs CPU-Bound

IO-Bound operations

CPU-Bound operations

Concurrency vs Parallelism

Parallelism: doing multiple things at the same time

Concurrency: dealing with multiple things at once

OS Threads

What the OS creates for each thread

Preemptive scheduling

Shared memory between threads

Thread-per-request in Python

Thread Overhead & Context Switching

1. Memory overhead

2. Creation overhead

3. Context switch overhead

The Event Loop Model

The core mechanism

OS-level IO multiplexing

Why event loops are efficient for IO

The critical trade-off: never block the loop

Callbacks → Promises → async/await: the evolution

Python's asyncio: the same idea

Async/Await as a State Machine

The original async function

What the runtime sees (conceptual)

Two things this explains

Go Routines & Virtual Threads

Why goroutines are cheap

Go's HTTP server: one goroutine per request

A typical Go handler

The M:N Scheduler

Race Conditions & Shared State

The classic counter example

Race conditions in async/await too

Locks, Mutexes & Channels

Mutex (Mutual Exclusion)

Python — threading.Lock

Go — sync.Mutex

Channels: Go's preferred approach

Python asyncio.Lock for async code

Choosing the Right Model

Language-specific concurrency primitives

Further Reading

Books

Documentation & articles